Data preprocessing

Data labelling

Data preprocessing techniques for ML

1. taking care of missing data

There are several ways to handle missing data:

2. handle outliers

3. log transform

4. categorical encoding = converting categorical columns to numerical columns

Challenges

  • Dummy Variable Trap: a scenario in which the independent variables are multicollinear: -> so, always omit one dummy variable!
  • Cannot Capture Interaction Effects: fix this by creating composite feature and computing explicit feature crosses

In

It saves a ton of memory space by using category data type instead of object.

5. splitting datasets into the training and test sets

6. feature scaling

= a method used to normalize the range of independent variables or features of data

two options

x=xx¯σ x=xmin(x)max(x)min(x)
Which to choose, from The Hundred-Page Machine Learning Book

  • unsupervised learning algorithms, in practice, more often benefit from standardization than from normalization
  • standardization is also preferred for a feature if the values this feature takes are distributed close to a normal distribution (so-called bell curve)
  • again, standardization is preferred for a feature if it can sometimes have extremely high or low values (outliers); this is because normalization will “squeeze” the normal values into a very small range
  • In all other cases, normalization is preferable.”

Why use feature scaling

It is optional, but beneficial to gradient descent and calculation speed; useful in at least three circumstances:

Important

  • don’t apply feature scaling on dummy variables!
  • better apply feature scaling on train and test dataset separately, in case of Data Leakage.

What is Feature Engineering then?

Feature engineering = selecting, extracting, and transforming the most relevant features from the available data to build
Thus, the techniques above can belong to feature engineering. Besides, there are feature creation and feature extraction/selection:

Best practice for feature engineering